Two baselines for unsupervised dependency parsing
نویسنده
چکیده
Results in unsupervised dependency parsing are typically compared to branching baselines and the DMV-EM parser of Klein and Manning (2004). State-of-the-art results are now well beyond these baselines. This paper describes two simple, heuristic baselines that are much harder to beat: a simple, heuristic algorithm recently presented in Søgaard (2012) and a heuristic application of the universal rules presented in Naseem et al. (2010). Our first baseline (RANK) outperforms existing baselines, including PR-DVM (Gillenwater et al., 2010), while relying only on raw text, but all submitted systems in the Pascal Grammar Induction Challenge score better. Our second baseline (RULES), however, outperforms several submitted systems. 1 RANK: a simple heuristic baseline Our first baseline RANK is a simple heuristic baseline that does not rely on part of speech. It only assumes raw text. The intuition behind it is that a dependency structure encodes something related to the relatively salience of words in a sentence (Søgaard, 2012). It constructs a word graph of the words in a sentence and applies a random walk algorithm to rank the words by salience. The word ranking is then converted into a dependency tree using a simple heuristic algorithm. The graph over the words in the input sentence is constructed by adding directed edges between the ∗ word nodes. The edges are not weighted, but multiple edges between nodes will make transitions between them more likely. The edge template was validated on development data from the English Penn-III treebank (Marcus et al., 1993) and first presented in Søgaard (2012): • Short edges. To favor short dependencies, we add links between all words and their neighbors. This makes probability mass flow from central words to their neighboring words. • Function words. We use a keyword extraction algorithm without stop word lists to extract function or non-content words. The algorithm is a crude simplification of TextRank (Mihalcea and Tarau, 2004) that does not rely on linguistic resources, so that we can easily apply it to low-resource languages. Since we do not use stop word lists, highly ranked words will typically be function words. For the 50-most highly ranked words, we add additional links from their neighboring words. This will add additional probability mass to the function words. This is relevant to capture structures such as prepositional phrases where the function words take content words as complements. • Morphological inequality. If two words wi, wj have different prefixes or suffixes, i.e. the first two or last three letters, we add an edge between them. Given the constructed graph we rank the nodes using the algorithm in Page and Brin (1998), also known as PageRank. The input to the PageRank algorithm is any directed graph G = 〈E,V 〉 and the output is an assignment PR : V → R of a score, also referred to as PageRank, to each node in the graph, reflecting the probability of ending up in that node in a random walk.
منابع مشابه
Two Approaches for Building an Unsupervised Dependency Parser and Their Other Applications
Much work has been done on building a parser for natural languages, but most of this work has concentrated on supervised parsing. Unsupervised parsing is a less explored area, and unsupervised dependency parser has hardly been tried. In this paper we present two approaches for building an unsupervised dependency parser. One approach is based on learning dependency relations and the other on lea...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملUnsupervised Dependency Parsing with Acoustic Cues
Unsupervised parsing is a difficult task that infants readily perform. Progress has been made on this task using text-based models, but few computational approaches have considered how infants might benefit from acoustic cues. This paper explores the hypothesis that word duration can help with learning syntax. We describe how duration information can be incorporated into an unsupervised Bayesia...
متن کاملWord Ordering as Unsupervised Learning Towards Syntactically Plausible Word Representations
The research question we explore in this study is how to obtain syntactically plausible word representations without using human annotations. Our underlying hypothesis is that word ordering tests, or linearizations, is suitable for learning syntactic knowledge about words. To verify this hypothesis, we develop a differentiable model called Word Ordering Network (WON) that explicitly learns to r...
متن کاملFrom ranked words to dependency trees: two-stage unsupervised non-projective dependency parsing
Usually unsupervised dependency parsing tries to optimize the probability of a corpus by modifying the dependency model that was presumably used to generate the corpus. In this article we explore a different view in which a dependency structure is among other things a partial order on the nodes in terms of centrality or saliency. Under this assumption we model the partial order directly and der...
متن کامل